# Unconventional Computing Architectures with Reconfigurable Devices in the Cloud

Michaela Blott Distinguished Engineer Jan. 2019



Lucian Petrica, Giulio Gambardella, Alessandro Pappalardo, Ken O'Brien, me, Nick Fraser, Yaman Umuroglu (from left to right)



# Agenda

# Background

**Industry Context** 

**Unconventional Computing Architectures** 





# Background



© Copyright 2018 Xilinx

### Xilinx Research - Ireland

> Part of the worldwide CTO organization (9 out of 36)

- Including Xilinx University Program (Cathal, Katie)
- AI Lab expansion part-financed through
- > Mission: Application driven technology development

🔅 🔒 IDA Ireland





Kees Vissers Fellow





# **Plus a Very Active Internship Program**

#### > On average 4-6 interns at any given time

- >> From top universities all over the world
- >> We are always looking for talent ;-)

#### > Overall

- >> 70+ interns since 2007
- >> Many collaborations have come from this
- >> Many found employment





# **Industry Context**



# "Trends meeting Technological Reality"



### Mega-Trend: The Rise of the Machine (Learning Algorithm)



# What's the Challenge? Example: Convolutional Neural Networks *Forward Pass (Inference)*



Basic arithmetic, incredible parallel but Huge Compute and Memory Requirements

**EXILINX**.

# **Compute and Memory for Inference**



### Mega-Trend: Explosion of Data

#### > Astronomically growing amounts of data

- >> More sensors
- >> More users
- >> More use cases: Genomics (DNA) "Genomical"





Stephens, Zachary D., et al. "Big data: astronomical or genomical?."

### Technology: End of Moore's Law & Dennard Scaling







#### **Economics become questionable**

#### Power dissipation becomes problematic



# Era of Heterogeneous Compute using Accelerators



- > Diversification of increasingly heterogenous devices and system
  - Moving away from standard van Neumann architectures

#### > True Architectural innovation & Unconventional Computing Systems



# **Evidence: Heterogenous Data Centers**



Insight 2016: AWS adding FPGA instances

© Copyright 2018 Xilinx



# Unconventional at System Level: Diversification with Accelerator Support



#### > With accelerators moving closer to the CPU

(OpenCAPI, CCIX, etc...)

**E** XILINX

# **Evidence: Heterogeneous Devices**



**EXILINX**.

#### > From the Xilinx World: Evolution of FPGAs to ACAPs

#### **E**XILINX.

# With reconfigurable computing, we can go even more unconventional: some examples

Key-Value Stores - customized data paths - customized memory subsystem



# Key Value Stores - Background



# **Current Implementations**

#### > Multithreaded implementation (pthreads)

- >> Each request is a connection
- >> All threads execute drive\_machine(), processes connections from one state to next, and switches over connection state
- >> Shared data structures (hash tables, value store,...)

#### > Bottlenecked by:

- >> Synchronization overhead
  - Threads stall on memory locks, serializing execution for x86s
- > TCP/IP is CPU intensive, interrupt intensive, too large to fit into instruction cache
- Last level cache ineffective due to random-access nature of the application (miss rate 60% 95% on x86)

#### > Performance significantly below 10Gbps line rate

- Intel Xeon (8cores): 1.34MRps, 200-300usec, 7KRPS/Watt

Receive & parse Hash lookup Value store access Format & transmit

```
drive_machine():
while (!stop) {
   switch(c->state) {
     case connection_waiting:
     case connection_closing:
     ...
```

case new\_command: lock socket; read from socket; unlock socket; parse; case read\_htable: hash key; lock hash table; hash table access; hash table LRU; unlock hash table; case write\_output:

# **Dataflow Architectures to Scale Performance**



> Order of magnitude improvement in latency and best in class for jitter

#### > 10Gbps demonstrated with a 64b data path @ 156MHz using 3% of FPGA resources

Source: [4] Blott et al: Achieving 10Gbps line-rate key-value stores with FPGAs; HotCloud 2013

© Copyright 2015 Xilinx





**E** XILINX.

Deep Learning
- customized precision arithmetic



### *Further unconventional at the Micro-Architecture, leveraging* Floating Point to Reduced Precision Neural Networks



**E** XILINX.

# **Reducing Precision** *Scales Performance & Reduces Memory*

#### > Reducing precision shrinks LUT cost

>> Instantiate **100x** more compute within the same fabric

#### > Potential to reduce memory footprint

>> NN model can stay on-chip => no memory bottlenecks

| Precision | Modelsize [MB]<br>(ResNet50) |  |
|-----------|------------------------------|--|
| 1b        | 3.2                          |  |
| 8b        | 25.5                         |  |
| 32b       | 102.5                        |  |



**EXILINX** 

# **Reducing Precision Inherently Saves Power**

**FPGA:** 



Target Device ZU7EV • Ambient temperature: 25 °C • 12.5% of toggle rate • 0.5 of Static Probability • Power reported for PL accelerated block only

**ASIC:** 

|                      |             |      | Relati | ve Energ | y Cost |       |
|----------------------|-------------|------|--------|----------|--------|-------|
| Operation:           | Energy (pJ) |      |        |          |        |       |
| 8b Add               | 0.03        |      |        |          |        |       |
| 16b Add              | 0.05        |      |        |          |        |       |
| 32b Add              | 0.1         |      |        |          |        |       |
| 16b FP Add           | 0.4         |      |        |          |        |       |
| 32b FP Add           | 0.9         |      |        |          |        |       |
| 8b Mult              | 0.2         |      |        |          |        |       |
| 32b Mult             | 3.1         |      |        |          |        |       |
| 16b FP Mult          | 1.1         |      |        |          |        |       |
| 32b FP Mult          | 3.7         |      |        |          |        |       |
| 32b SRAM Read (8KB)  | 5           |      |        |          |        |       |
| 32b DRAM Read        | 640         |      |        |          |        |       |
|                      |             |      | 10     | 100      | 1000   | 1000  |
| Source: Bill Dally ( | Stanford).  | Cade | ence l | Embed    | lded N | eural |
|                      | k Summit, F |      |        |          |        |       |



>> 24 Rybalkin, V., Pappalardo, A., Ghaffar, M.M., Gambardella, G., Wehn, N. and Blott, M. "FINN-L: Library Extensions and Design Tradeoff Analysis for Variable Precision LSTM Networks on FPG@©öpyright 2018 Xilinx

**E** XILINX.

# **Design Space Trade-Offs**



### Even More Unconventional: *Bit-Parallel vs Bit-Serial*

> Furthermore, with bit-serial can provide run-time programmable precision with a fixed architecture



> FPGA: Flexibility comes at almost no cost and provides equivalent bit-level performance at chiplevel for low precision\*

Umuroglu, Rasnayake, Sjalander "BISMO: A Scalable Bit-Serial Matrix Multiplication Overlay for Reconfigurable Computing." FPL'2018
26 <a href="https://arxiv.org/pdf/1806.08862.pdf">https://arxiv.org/pdf/1806.08862.pdf</a> © Copyright 2018 Xilinx

**E** XILINX.

# Summary



© Copyright 2018 Xilinx

# Summary

- Unconventional computing architectures emerge at data center, system and device level
- With reconfigurable computing we can go even more unconventional
- Leveraging customized dataflow architectures and memory subsystems, custom precisions
  - To provide dramatic performance scaling and energy efficiency benefits
  - To enable new exciting trade-offs within the design space

**E** XILINX

# Challenges in Futures

Programming unconventional systems

- Benchmarking heterogeneous systems for specific applications
  - That are fundamentally differently programmed
  - That exploit different points within the design space

How can you apply some of these concepts to other applications?



# THANK YOU!

# Adaptable. Intelligent.



#### More information can be found at: http://www.pynq.io/ml



© Copyright 2018 Xilinx